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Abstract 

Background: Small peptides encoded as one- or two-exon genes in plants have recently been shown to affect 
multiple aspects of plant development, reproduction and defense responses. However, popular similarity search tools 
and gene prediction techniques generally fail to identify most members belonging to this class of genes. This is 
largely due to the high sequence divergence among family members and the limited availability of experimentally 
verified small peptides to use as training sets for homology search and ab initio prediction. Consequently, there is an 
urgent need for both experimental and computational studies in order to further advance the accurate prediction of 
small peptides. 

Results: We present here a homology-based gene prediction program to accurately predict small peptides at the 
genome level. Given a high-quality profile alignment, SPADA identifies and annotates nearly all family members in 
tested genomes with better performance than all general-purpose gene prediction programs surveyed. We find 
numerous mis-annotations in the cunentArabidopsis thaliana and Medicago truncatula genome databases using 
SPADA, most of which have RNA-Seq expression support. We also show that SPADA works well on other classes of 
small secreted peptides in plants (e.g., self-incompatibility protein homologues) as well as non-secreted peptides 
outside the plant kingdom (e.g., the alpha-amanitin toxin gene family in the mushroom, Amanita bisporigera). 

Conclusions: SPADA is a free software tool that accurately identifies and predicts the gene structure for short 
peptides with one or two exons. SPADA is able to incorporate information from profile alignments into the model 
prediction process and makes use of it to score different candidate models. SPADA achieves high sensitivity and 
specificity in predicting small plant peptides such as the cysteine-rich peptide families. A systematic application of 
SPADA to other classes of small peptides by research communities will greatly improve the genome annotation of 
different protein families in public genome databases. 
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Background 

A major challenge in translating new genome sequences 
into useful community resources is the accurate annota- 
tion of genes and other functionally-relevant features [1]. 
While there have been clear improvements in gene pre- 
diction algorithms [2], accurate prediction of small one 
and two-exon genes remains stubbornly problematic [3]. 
False-positive signals arising from the poor specificity 
of promoter motifs and other commonly-used signals 
employed by general purpose gene-finding algorithms are 
widespread [4-7]. To address the flood of false-positive 
signals for small genes, many annotators filter out small- 
gene predictions lacking direct experimental expression 
evidence, resulting in a major problem of false negatives 
[3,4]. 

We propose here an alternative and complemen- 
tary strategy for genome-wide annotation - a strategy 
that has as its strength predicting the small one- 
and two-exon genes that all-purpose gene-finding algo- 
rithms often fail to predict accurately. Our approach 
focuses on finding all related paralogous genes within 
a target gene family and then using signals from the 
corresponding multiple sequence alignment to aid in 
refining the model predictions. We have implemented this 
approach in an open-source and freely available appli- 
cation called SPADA (Small Peptide Alignment Discov- 
ery Application). SPADA can be used directly with a 
users own protein family alignments or with a com- 
prehensive set of protein family alignments from pub- 
lic sources such as Pfam [8], InterPro [9] or PROSITE 
[10], enabling the exhaustive discovery of essentially all 
members of the input families within a given genome 
sequence. Because these public resources continue to 
expand and include new and novel protein families, 
SPADAs ability to comprehensively identify arbitrarily 
large families of small peptides in genomes will steadily 
grow. 

Here we describe the conceptual basis of SPADA 
and go onto test its performance with selected fam- 
ilies of notoriously difficult genes to annotate prop- 
erly - specifically, plant Cysteine-Rich Peptides (CRPs) 
in two model plant species {Arabidopsis thaliana and 
Medicago truncatula), the S -Protein homologue (SPH) 
family in A, thaliana, and the alpha-amanitin toxin 
gene family in the mushroom Amanita bisporigera. In 
the case of CRPs, we examine the accuracy and recall 
compared to published composite test/training sets for 
these species based on previous semi-manual cura- 
tion and subsequent experimental expression validation 
[11-13]. We also compare SPADAs performance against 
a range of commonly-used generic gene-prediction algo- 
rithms [14-17], providing evidence of SPADAs advan- 
tage in identifying these challenging classes of small 
peptides. 



Method 

SPADA is a computational pipeline that, when provided 
with a multiple sequence alignment for a gene/protein 
family of interest, identifies all members of this fam- 
ily in a target genome. Technically, SPADAs pipeline is 
a general homology-based gene finding program with 
specifically enhanced power to detect and annotate small 
peptides with one or two exons. Unlike general-purpose 
gene prediction programs such as Fgenesh [18], SPADA 
works on an entire gene family at one time - with 
the goal of finding all family members in the genome. 
Unlike other homology-based gene predictors such as 
Genewise [19] and Exonerate [20] that map a single pro- 
tein sequence to the target genome, SPADA performs 
a similarity search using a profile alignment and iden- 
tifies all homologs of the family. In addition, SPADA 
provides automated access to both similarity search tools 
(e.g., BLAST [21] and HMMER [22]) and ab initio gene 
predictors (e.g., Augustus), significantly improving the 
annotation efficiency of multi-member gene families. As 
shown in Figure 1, SPADA consists of four consecutive 
components: 

• Pre-Processing 

• Motif Mining 

• Model Prediction 

• Model Evaluation & Selection 

In SPADA, HMMER [22] is first used to identify hits 
in the target genome sequence (translated in six reading 
frames) as well as in the target proteome (if available) 
using a reasonably generous E-value (10). These hits are 
then tiled with regard to their genomic coordinates and 
merged into overlapping clusters. Finally, one best hit in 
each cluster is picked to generate a list of candidate hits. 

The pipeline then allows the user to run one or more 
processes to predict gene structures for these potential 
genes. By default, SPADA runs Augustus [17] using hit 
locations as clues for "CDS regions" (coding sequence). In 
parallel, SPADA runs a custom pipeline optimized for pre- 
dicting the exon boundaries of genes containing one or 
two exons using GeneWise [19], SplicePredictor [23] and 
custom Perl scripts. 

In the next step, all gene structure predictions are com- 
bined to make a raw calling set, with each hit having one 
or more gene structure predictions. SPADA uses multiple 
statistics to assess the confidence of each candidate gene 
model, including an alignment score (mean pairwise score 
with known members in the original family-specific mul- 
tiple sequence alignment), an HMM alignment score (sum 
of posterior probability scores in the Hmmsearch output 
file), the presence/absence of proper start/stop codons, as 
well as the SignalP D-score [24] in the case of secreted 
peptides. 
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Figure 1 The SPADA workflow. 



Finally, the best candidate gene model is picked for each 
hit and the resulting set is filtered using empirical cutoffs 
(hmmsearch E-value of 0.001) to remove false positives. 

Pre-processing 

Building family-specific multiple sequence alignments 

The original motivation for developing this pipeline was 
accurately identifying and predicting Cysteine-Rich Pep- 
tides (CRPs) in plant genomes. For this purpose, SPADA 
comes with a complete set of manually- c urate d pro- 
tein sequence alignments for plant Cysteine-Rich Pep- 
tide (CRP) families [11]. In 2007, Silverstein et al, built 
multiple sequence alignments for most plant CRP fam- 
ilies through iteratively scanning EST sequences from 
different plant species in TIGRs Gene Indices [11]. 
These alignments were re-aligned here using ClustalO 
[25] and trimmed using trimAl [26] to remove spu- 
rious sequences and poorly aligned positions. Finally, 
a profile Hidden Markov Model (HMM) was built for 
each CRP family using "hmmbuild" in the HMMER 
package [27]. 

As a general homology-based gene finding program, 
SPADA has been designed to work with any set of pro- 
tein families. Users can start with a list of amino acid 
sequence alignments of their own interest, run the script 
"build_profile.pr' to generate custom HMM profiles, and 
initiate the pipeline using the new HMM(s). With this in 
mind, we have tested SPADAs performance on an addi- 
tional protein family as a proof of concept (see Results 



section) and assessed its applicability to secreted protein 
families other than CRPs. 

Processing genome sequence and annotation 

In SPADA, genome FASTA sequences are translated in 
all six reading frames to amino acid sequences and then 
Open Reading Frames (ORFs) are extracted by breaking 
up these long amino acid sequences using stop codons. 
Here an ORF is defined as a segment of amino acid 
sequence with at least 15 residues and uninterrupted by 
stop codons. Extracting ORFs from the original trans- 
lated genomic sequence reduces the target database size 
for the subsequent motif mining step and improves sen- 
sitivity. Using ORFs also ensures that no protein-coding 
exon spans stop codons in the middle of a sequence and 
that each exon will have a reasonable length. In theory, 
all protein-coding exons should locate within these ORFs, 
which will be discovered in the next motif-mining step. 
However, the exact exon boundaries are still unclear at this 
point of the search procedure. 

If a gene annotation file in General Feature Format ver- 
sion 3.0 (GFF3) is available, SPADA can also read and 
process it, extracting the amino acid sequences of existing 
annotations and passing them onto the next motif mining 
step. In doing so, exon boundaries can be better refined, 
further improving the accuracy in the model prediction 
step. 

Hard-masking of genome sequences is recommended 
(replacing repetitive sequences with 'N s) before running 



Zhou etal. BMC Bioinformatics 2013, 14:335 
http://www.biomedcentral.eom/1 471 -2 1 05/1 4/335 



Page 4 of 16 



the pipeline. Some plant species have very large genomes 
with highly repetitive content (e.g., Maize [28]). By hard- 
masking the genome sequence, the target database size in 
the motif-mining step is effectively reduced, significantly 
improving the search sensitivity of the entire pipeline. 
However, if many family members locate in repeat-rich 
genomic regions (such as the fungi effector families [29]), 
the unmasked genome version should be used. 

Motif mining 

In SPADA, profile HMMs are used to search against trans- 
lated genomic sequences (and known protein sequences, 
if available) using "hmmsearch" a component of the 
HMMER package (v3.0) [27]. This program finds signif- 
icant hits against a protein sequence database using one 
or more profiles as inputs. The output of the scan is a 
list of genomic intervals with significant sequence similar- 
ity to query profiles and amino acid sequences translated 
from these intervals. For single-exon genes, a contiguous 
stretch of amino acid sequence in the target databases 
will be discovered, roughly corresponding to the exon in 
the original genomic sequence. For genes containing two 
or more exons, partial amino acid sequence hits corre- 
sponding to different exons will be separated by introns 
(if they share a reading frame) or distributed in different 
target sequences (if in different reading frames). SPADA 
collects all these full and partial hits in translated protein 
sequences, recovers their original genomic coordinates, 
filters out low-significance hits (E- value lower than 0.1), 
selects the most significant hit for each genomic inter- 
val since multiple input profiles may hit the same region, 
and merges nearby partial hits. During this merging step, 
SPADA requires that each neighboring partial hit should 
hit a different segment (either upstream or downstream) 
in the input profile HMM. The merged genomic intervals 
(called "extended hits") roughly correspond to the mul- 
tiple exons in the underlying gene model - although the 
exact intron-exon boundaries and start/stop codon loca- 
tions are yet to be refined at this stage of the procedure. 

In parallel, SPADA searches against existing protein 
sequences (generated using the GFF3 annotation file), 
yielding a separate list of hits to the input profiles. 
These hits are also treated as partial hits, i.e., mapped 
to their original genomic coordinates and then used 
to build "extended hits". This "hmmsearch against pro- 
teome" step is considered complementary to the above- 
mentioned "hmmsearch against translated genome" step, 
since it improves prediction sensitivity by capturing other- 
wise non-significant partial hits in the translated genome 
search. 

Model prediction 

At this point, SPADA has generated a list of "extended 
hits" approximately corresponding to actual exon bound- 



aries. For each extended hit the surrounding genomic 
sequence is extracted. By default, 2500 bp upstream from 
the hits are extracted, since the first exon (containing 
the signal peptide) is usually separated from the sec- 
ond exon (with the mature peptide) by an intron up to 
1500 bp, as determined by manual curation and under- 
standing of plant genomes. At the other end, 1500 bp 
downstream from the hit boundaries are extracted, since 
the correct stop codon can typically be found within 
1000 bp downstream of the HMM hit [11]. SPADA next 
runs one or more components (selected by the user) in 
parallel to determine gene structure in this region. A 
total of five prediction components are currently sup- 
ported by the pipeline: Augustus [17], GeneWise [19], 
GlimmerHMM [14], GeneMark [15] and GenelD [16]. 
By default SPADA only runs two of these components 
(Augustus and GeneWise) since performance evaluation 
on a group of common plant peptides suggests that run- 
ning all five of them does not offer a significant extra 
gain compare to running just two of them (see Results & 
Discussion). 

The first component, which we denote "Augustus_ 
evidence", runs Augustus [17] in its "evidence mode". The 
genomic sequence is used as input along with a "hint 
file" providing the program instructions for which part(s) 
of the input sequence are known to be part(s) of the 
coding sequence. In other words, location information 
of extended hits is incorporated in the prediction pro- 
cess. Augustus will then try to complete the gene model 
by looking for start/stop codons and canonical donor- 
acceptor splice sites around the hits while preserving 
the open reading frame. The improvement in predic- 
tion accuracy and specificity by running Augustus in the 
"evidence mode" (as compared to the "Augustus de novo 
mode") is significant and will be discussed in Results & 
Discussion. 

In parallel, SPADA runs a custom pipeline specifically 
designed to identify and predict genes with one or two 
exon(s) and with a leading signal peptide. This compo- 
nent, which we denote "Genewise+SplicePredictor", first 
runs GeneWise [19] to align the extended hit sequence 
(translated to amino acid sequence) to genomic sequence 
and identifies compatible splice sites that preserve the hit 
ORFs. If GeneWise fails due to non-canonical splice sites, 
SPADA then runs SplicePredictor [23] to find all possi- 
ble donor/acceptor splice sites and extracts compatible 
ones, extending the ORFs to the nearest start codon and 
stop codon. In practice, this "Genewise+SplicePredictor" 
approach works well as a complement to the "Augus- 
tus_evidence" approach (see Results and Discussion). 

At this stage in the pipeline, SPADA reports all compat- 
ible gene models predicted by the two components. These 
candidate models are then passed on to the next step for 
evaluation in order to generate a best calling set. 
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Model evaluation & selection 

For each extended hit, SPADA then evaluates the under- 
lying candidate models using a number of measures 
and picks the most "confident" model for output. These 
evaluation statistics include the presence of start/stop 
codons at the beginning/end of the model, the presence 
of inframe stop codons, the SignalP score [24] in the case 
of a secreted gene family, and in particular, the Multiple 
Sequence Alignment (MSA) score and the "Hmmsearch 
Probability" (HmmProb) score, as described below. 

In theory, the correct gene model should encode an 
amino acid sequence that aligns to the original family- 
specific protein alignments better than any other candi- 
date models. To calculate the MSA score, SPADA aligns 
the amino acid sequence of the candidate model to 
the profile alignment using ClustalO "profile-to-profile" 
mode [25]. SPADA then scores all pairwise alignments 
using BLOSUM80 scoring matrix [30] and calculates a 
mean alignment score. The BLOSUM80 matrix is used 
instead of BLOSUM62 because the sequences that are 
being aligned tend to be fairly similar to each other, and 
a matrix with more conserved target frequencies such as 
BLOSUM80 should be more reasonable. Ideally, the can- 
didate model with the highest MSA score should be the 
most probable model. 

Nevertheless, the MSA score is not sufficient to pick 
the best model, since candidate models are sometimes too 
close to each other in sequence and the MSA scores may 
not vary appreciably among model alternatives. Therefore 
SPADA also calculates an "Hmmsearch Probability" Score 
for each candidate model. In theory, if hmmsearch is run 
using the original family HMM against all candidate mod- 
els, the most significant hit in the output should then be 
the best model. In practice, the probability score in the 
hmmsearch output serves as a better predictor than the E- 
value itself, especially when a model contains more than 
one hit domain. The MSA score and the HmmProb score 
are used to evaluate each candidate model. SPADA then 
picks the best candidate model that meets the following 
criteria: (1) it has a SignalP D-score of no less than 0.4 
(determined according to the software manual, this option 
could be turned off to allow prediction of non-secreted 
gene families); (2) it has proper start/stop codons and no 
premature stop codon; and (3) it has the highest (MSA 
score + HmmProb score). 

SPADA uses a relatively relaxed E-value cutoff in the 
motif mining step (e.g., 10 for running hmmsearch) in 
order to increase specificity. This also results in numerous 
false positive hits. These hits will generally not have valid 
candidate gene models built for them in the model pre- 
diction step, and thus would not make it into the ultimate 
output. However, SPADA does employ a final filtering 
step based on hmmsearch E-value to refine gene models 
that are retained. We evaluated the performance of the 



pipeline under different final E-value cutoffs (see Results 
and Discussion) and set the default cutoff to 0.001, which 
may be adjusted by the user to achieve customized search 
purposes. For all gene models passing the filter, SPADA 
outputs the sequences in FASTA format and gene coordi- 
nate information in GFF format. SPADA also generates for 
each gene family a multiple sequence alignment includ- 
ing all predicted models and the family-specific consensus 
sequence. If a gene annotation file has been passed to the 
pipeline, SPADA will also report the comparison results of 
predicted models with existing annotation (e.g., the num- 
ber of models with exactly the same exon boundaries, 
models with partial overlap, models in different reading 
frames, etc.). 

Performance evaluation 
Compilation of the test set 

We have compiled a test set of plant cysteine-rich peptides 
(CRPs) in two genomes: the model dicotyledon, Arabidop- 
sis thaliana, and the model legume, Medicago truncat- 
ula. In previous work, Silverstein et al have exhaustively 
searched and curated all 516 CRP families (CRPOOOO - 
CRP6250) in Arabidopsis [11]. A large number of CRPs 
have also been identified and curated in an early release 
of the Medicago genome sequence [11]. Recently, as a col- 
laborative effort with J. Craig Venter Institute (JCVI), we 
expanded this list of CRPs in M, truncatula. We manually 
inspected and curated 136 CRP families (CRPOOOO - 
CRP 1530, focusing specifically on the Defensin-Like pro- 
teins or DEFLs) in M. truncatula [31]. Finally, we now 
have a complete list of CRP members for Arabidopsis 
and Medicago (742 for Arabidopsis and 725 for Medicago, 
Additional file 1: File SI). 

We collected evidence from multiple sources to validate 
the expression of the models in the compiled test set. On 
the one hand, extensive RNA-Seq data were downloaded 
from NCBI Sequence Read Archive [32] for both Ara- 
bidopsis 2ind Medicago; on the other hand, we downloaded 
the AtMtDEFL microarray dateset [12,13] to find addi- 
tional support for expression of these gene families. The 
AtMtDEFL array include probe sets for 317 Arabidop- 
sis DEFLs, 15 Arabidopsis DEFL-related Genes (MEGs), 
and 684 Medicago DEFLs, plus additional marker genes. 
In total, 583 (78.6%) out of the 742 CRPs in the Ara- 
bidopsis test set and 657 (90.6%) out of the 725 Medicago 
CRPs receive support from either RNA-Seq (FPKM >1) or 
microarray data (Additional file 2: Table S3 and Additional 
file 3: Table S4). These carefully curated, high-quality CRP 
calls were then taken as our test set in evaluating the per- 
formance of different model prediction components in 
SPADA under different hmmsearch E-value cutoffs. All 
experimental procedures complied with the guidelines of 
the Institutional Biosafety Committee (IBC) at University 
of Minnesota (IBC code: 1301-30313H). 
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Evaluation procedure 

We tested a number of popular gene prediction pro- 
grams as SPADA model prediction components. In 
addition to the previously mentioned components, we 
also tested GenelD (vl.4.4) [16], GlimmerHMM (vS.O.l) 
[14], GeneMark (v3.9d) [15], and Augustus (v2.6.1, de 
novo mode). The "Augustus_evidence" differs from the 
"Augustus_de_novo" component simply by the inclusion 
of a "hint file" (with hit location information) fed to 
the program. We evaluated the pipeline performance 
running these components (individually or in combina- 
tion) based on our curated test dataset (see Results and 
Discussion), and decided to use the "Augustus_evidence" 
and "Genewise+SplicePredictor" approaches as default 
components in the SPADA model prediction step. 

Prediction performance was measured at two different 
levels: coding nucleotide sequence and exonic structure. 
At each level, we measured the sensitivity and specificity 
for each component. We first define the true positives (TP, 
number of coding nucleotides that are correctly predicted 
as coding), true negatives (TN, number of noncoding 
nucleotides that are correctly predicted as noncoding), 
false negatives (FN, number of coding nucleotides pre- 
dicted as noncoding) and false negatives (FN, number of 
noncoding sequences predicted coding). At the nucleotide 
level. Sensitivity (Sn) is then defined as the proportion 
of coding nucleotides that have been correctly predicted 
as coding (Sn= j'p^p^ )^ while Specificity (Sp) is the pro- 
portion of predicted coding nucleotides that are actually 
coding (Sp= jp^pp ) [33]. At the exon level, Sn is the pro- 
portion of actual exons in the input sequence that are cor- 
rectly predicted, while Sp is the proportion of all predicted 
exons that are correctly predicted [33]. Other measures 
such as Correlation Coefficient (CC) and Average Con- 
ditional Probability (ACP) were not evaluated since they 
require the calculation of TN nucleotides/exons, which 
are noncoding regions that are predicted as noncoding. 
Unlike a general gene-finding program that tries to predict 
all coding genes in a given sequence, SPADA focuses only 
on coding genes that are significantly similar to a given 
profile, while ignoring all other genes. Consequently "TN" 
statistics is not straightforward to evaluate in this context. 

Performance evaluation was done in both A, thaliana 
and M, truncatula. The extracted genomic sequences 
were used as input sequences. We evaluated the pipeline 
performance using each of the "GenelD", "Augustus_ 
de_novo", "GlimmerHMM", "GeneMark", "GeneWise+ 
SplicePredictor", "Augustus_evidence" component (indi- 
vidually), as well as "SPADA" (combination of "Gene- 
Wise+SplicePredictor" and "Augustus_evidence") and 
"All" (combination of all 6 individual components). All 
programs were installed and run locally on a GNU/Linux 
workstation. The appropriate parameter files, model files 
and training directories, if available, were used to run 



these programs in each species, otherwise the default 
parameter files (which are for Arabidopsis) were used. 
The output of these runs were parsed to derive a unique 
prediction for each test sequence. 

RNA-Seq and microarray processing, data visualization 

We mapped the RNA-Seq short reads (downloaded form 
NCBI SRA) to the reference using TopHat and summa- 
rized the results using Cufflinks [34]. Cufflinks is able to 
estimate the expression value at the level of transcripts. 
We used a cutoff of FPKM (Fragments Per Kilobase of 
exon model per Million mapped fragments) > 1 to deter- 
mine if a model (either in the test set or in the SPADA 
prediction set) is expressed. 

For the AtMtDEFL array, PMA (Present, Marginal and 
Absent) calls and normalized expression values of each 
probe set were obtained from the supplemental tables 
of two recent papers [12,13]. We mapped the probe 
sequences to the transcript models in the test set as well 
as SPADA prediction. In many cases the annotated gene 
boundaries are not complete and lack portions of the 3'- 
UTR that is prioritized in Affymetrix designs, and the 
probes designed in these regions would not be mapped. 
As a result, we require at least six probes in a probe 
set matching the target gene (with 23 or more identical 
nucleotides for each 25-mer oligo probe). Finally, the PMA 
calls of a probe set should be 'Present' in at least one 
tissue/treatment condition to indicate expression support 
for the transcript model it is mapped to. 

In order to visualize some of the novel SPADA predic- 
tions as compared to the original genome annotation, as 
well as the underlying RNA-Seq read mapping support, 
we loaded the data (genome sequence file, annotation 
GFF file, SPADA prediction GFF file, RNA-Seq mapping 
BAM file) into IGV (Integrative Genomics Viewer [35]), 
adjusted the width of each track, and made screenshots. 

Results 

Performance evaluation of SPADA on plant Cysteine Rich 
Peptide (CRP) families 

SPADA performance under different search E-value 
thresholds 

Using our manually-curated high-quality CRP test set 
from Arabidopsis and Medicago, we first evaluated the 
performance of SPADA under different search E-value 
thresholds. Generally speaking, with a loose E-value 
threshold (e.g., 0.1), SPADA is able to predict almost all 
true models (i.e., achieving high sensitivity) while making 
many false predictions (i.e., specificity is low) (Additional 
file 4: Figure SI). By setting the search threshold to a more 
stringent value, SPADA avoids making most of the false 
predictions, but also loses a small number of true mod- 
els. In an effort to optimize search sensitivity (the ability 
to detect all true gene models) and specificity (prevent 



Zhou etal. BMC Bioinformatics 2013, 14:335 
http://www.bionnedcentral.conn/l 471 -21 05/1 4/335 



Page 7 of 16 



detection of spurious false models; refer to the "Method / 
Performance evaluation / Evaluation procedure" section 
for a formal definition of Sensitivity and Specificity), we 
set the default search E-value threshold to 0.001. Users can 
also change the default E-value threshold to build custom 
searches (e.g., a very sensitive search using E-value cutoff 
of 1 to identify all potential hits). 

Performance comparison of different gene prediction 
components 

We then compared the performance of SPADA run- 
ning different model prediction components: GenelD, 
Augustus {"de novo" mode as well as "evidence" mode), 
GlimerHMM, GeneMark, GeneWise+SplicePredictor as 
well as "SPADA" (combination of "Augustus_evidence" 
and "GeneWise+SplicePredictor") and "All" (combination 
of all 6 individual components) (Figure 2, Additional 
file 4: Figure SI). The high specificities observed in all 
components are likely due to the model evaluation and 
selection step, where most false models are filtered. Pre- 
diction sensitivities, on the other hand, show substantial 
differences among components. In both genomes tested, 
"Augustus_evidence" and "GeneWise+SplicePredictor" 
gave the highest sensitivities among the six individual 
components. The default SPADA pipeline (denoted as 
"SPADA" in the figure) runs these two components and 
achieved even higher sensitivity. On the other hand, 
running all six individual components (denoted as "All" 
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Figure 2 Performance comparison of different gene prediction 
components. Search E-value threshold is set to 0.001 by default. 



in the figure) gives the highest sensitivity, suggesting 
that search accuracy can still be improved by includ- 
ing more heterogeneous prediction programs in the 
pipeline. However, the gain in sensitivity offered by 
running all six components is marginal compared to 
running just two of them ("Augustus_evidence" and 
"GeneWise+SplicePredictor"), suggesting that a plateau 
in search accuracy could soon be reached and adding 
more prediction programs in the pipeline may not help 
much. 

These results are expected as SPADA does not work 
as a general gene finding program but instead focuses 
on particular classes of genes with known profiles. Small 
genes are typically difficult to predict and often missed by 
genome annotation pipeline due to the intrinsic properties 
of many automatic gene finding algorithms [36]. In our 
test with GenelD, Augustus and GlimmerHMM against 
the Medicago genome, the Arabidopsis training matrix 
was used since a Medicago specific one is not yet available. 
This explains to a large extent the extremely low sensi- 
tivity performance for these three programs in Medicago, 
Search specificities were generally quite high and did not 
vary much among different programs or genomes tested, 
indicating the relatively stringent search E-value (0.001) in 
effect allows few false positives. 

Cysteine-rich peptides predicted by SPADA in Arabidopsis 
and Medicago 

Using the default search E-value threshold and model pre- 
diction components, SPADA predicts 745 CRPs in Ara- 
bidopsis and 1170 (747 for CRP0000-CRP1530) in Med- 
icago (Table 1, Additional file 5: Table SI, Additional file 6: 
Table S2, Additional file 7: Figure S2, Additional file 8: 
Figure S3 and Additional file 9: File S2). These numbers 
are generally consistent with our manually curated CRP 
test sets (742 iov Arabidopsis) and 725 for Medicago [11], 
with a sensitivity of 91%-93% and specificity of 85%-95% 
at the nucleotide level (Additional file 4: Figure SI). Mem- 
bers within a sub-class typically show a conserved signal 
peptide and cysteine configuration (Additional file 10: 
Figure S4 for example). We also checked the expression of 
these predictions using publicly available RNA-Seq data 
from NCBI: 570 (76.5%) out of the 745 Arabidopsis CRPs 
and 947 (80.9%) out of the 1170 Medicago CRPs receive 
either RNA-Seq or AtMtDEFL array expression support 
(Additional file 11: Table S5 and Additional file 12: Table 
S6). It should be noted that SPADA makes no attempt 
to predict pseudogenes as it filters out hits with in-frame 
stop codons. However, some pseudogenes with premature 
stop codons might still be predicted by SPADA as valid 
gene models if the in-frame part shows significant (though 
incomplete) similarity to the search HMM. This in part 
explains the higher number of SPADA predictions (747 for 
CRP0000-CRP1530) in Medicago than our test set (725) 
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Table 1 Cysteine-Rich Peptides (CRPs) predicted in A. thaliana and M. truncatula 







IK. thaliana 


M. truncatula 


r)pfpn<;in rplp^tpH 


CRPonnn-rRPo^fin etc 


56 


43 


1 CR/RFTI related 


rRPO^fin-CRPORI 0 etc 


162 


1 10 


SCR related 


CRPOR^O-CRPOHRO 


32 


5 


l\/lpf a| InrpirhnYvnentida^e inhihitnr 


rRPinn4-rRPin^n 


0 


1 


CCP related 


CRP1040-CRP1 120 


19 


4 


Nndiile C~\/'^teine-Ri^h nentide 


CRP1 1 30-CRP1 S30 


3 


583 


Rineninn related nrntein 

Ixlkj^l III IvJ I^IQL^V-J kJIV^L^II 1 


CRP1600-CRP160S 


0 


21 


|\|(-)\/p| farniK/ 

iN^^vtri lull Illy 


CRPl 6?0 CRP^ROO etc 

\„i \i 1 wz_w,\„i \i z_L^ww,cr Lv„, 


14 


1 5 


J\/ljc(-p||anpni |c; 

1 V 1 1 jk-Cri IQ 1 IdwL-tJ 


CRPI 640-rRPI 660 etc 


16 


48 


Ranid Alkaliniyatinn Fartnr 

IxOLJIV^ / VliVaill IIZ-OLIV^I 1 1 Qv-LV^I 


CRPI 700-rRP71 70 

V_l \l 1 / \J\J \-.\ \\ Z_ 1 Z_ \J 


38 


36 


Thinnin related 

1 1 iiv^i III 1 icriaLcrv_j 


rRP77oo-rRP76io 


66 


23 


Rnnt ran/late emhrvnnene^i^ 


CRP2820-CRP2850 


5 


7 


Antimicrobial peptide MBP-1 


CRP2900-CRP3000 


1 


2 


Bowman Birk inhibitor 


CRP3100-CRP3190 


0 


16 


Pollen Ole e 1 
1 wii^i 1 v^i^ ^ 1 


CRP3300-CRP3S10 


34 


44 


ECA1 gametogenesis related 


CRP3600-CRP3740 


124 


17 


Lipid transfer protein 


LKP3b00-LKP49D2 


1 27 


1 27 


2S Albumin 


CRP4970-CRP5080 


5 


3 


Glutenin/Giadin/Prolamin 


CRP5090-CRP5270 


0 


0 


Maternally-expressed gene/Ael 


CRP5300-CRP5520 


20 


2 


Proteinase inhibitor II 


CRP5545-CRP5600 


6 


2 


Chitinase/Hevein 


CRP5610-CRP5820 


10 


15 


Kunitz type inhibitor 


CRP6010-CRP6180 


7 


45 


Total 




745 


1170 



since pseudogenes were manually removed to obtain the 
test set. 

The default E-value threshold of 0.001 is a compromise 
between sensitivity and specificity that generally works 
well for both organisms. For the purpose of identifying 
all potential small coding genes, a search with high sensi- 
tivity should be performed since it allows the user to see 
all potential hits and then determine for him/herself the 
boundary between false predictions and true predictions 
based on search scores. The users can then set the cut- 
off threshold empirically and select genes for experimental 
verification on their own. Thus, we also report here two 
CRP prediction sets by running SPADA using E-value 
threshold of 1 (Additional file 13: Table S7 and Addi- 
tional file 14: Table S8). In practice, users are encouraged 
to change the default E-value threshold to build custom 
searches. 

According to the latest versions of genome annota- 
tion for Arabidopsis and Medicago, about 5% to 15% 
of SPADA predictions fall completely into intergenic 
regions (i.e., are un-annotated. Table 2). Through man- 
ual inspection of these models, we found that some of the 



unannotated models turn out to be ORFs with premature 
stop codons (i.e., pseudogenes), while others had quite 
significant hmmsearch E-value and complete ORFs. In 
addition, some predicted models receive expression sup- 
port from either existing EST sequence or RNA-Seq data. 
An example is shown in Figure 3 where the predicted CRP 
model is supported by RNA-Seq mapping, fits well in the 
family-specific alignment, but was missed by the genome 
annotation {Medicago genome annotation version 3.5) as 
well as by our test set. While such cases are infrequent 



Table 2 Novel CRP models identified by SPADA 
determined by manual inspection 





A. thaliana 


M. truncatula 


Total predictions 


745 


1170 


Number of unannotated predictions^ 


5 


125 


Number of novel models^ 


3 (60%) 


77 (62%) 



^An unannotated prediction is a gene model predicted by SPADA but missed by 
current genome annotation. 

^Novel models are unannotated predictions that are manually inspected to be 
true members of the family with evidence from family-specific alignment and/or 
RNA-Seq evidence. 
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Figure 3 A novel gene model predicted by SPADA is missed by the current Medicago annotation. A Medicago NCR (hi 001 .01) shown in IGV 
(above figure) and subgroup alignment of CRPl 180 (below figure, hi 001. 01 shaded). 




(e.g., only 10 in Arabidopsis), we speculate the specificity 
of SPADA is likely to be underestimated. 

We then performed manual inspection on these unan- 
notated CRP models and tried to determine whether the 
calls are truly bad predictions (e.g., pseudogenes with 
pre-mature stop codons) or valid members of the fam- 
ily missed by current genome annotation (criteria being 
that the predicted model fits well in the family-specific 
alignment and has either RNA-Seq or Affymetrix expres- 
sion support). The number of "novel" CRPs discovered 
in this fashion, is given in Table 2 (Additional file 15: 
File S3). SPADA was able to identify 77 novel CRPs in 
Medicago that were missed by current genome annotation 
pipeline. The actual number of new CRPs in Medicago 
will be even higher since we only evaluated a subset of all 
CRP groups (CRP0000-CRP1530). This result is not unex- 
pected given that the Medicago genome was released only 
recently and resources and efforts put into the genome 
annotation pipeline have been limited. On the other hand, 
only 3 novel CRPs were found in Arabidopsis, suggesting 
a relatively higher quality of gene calls in this extensively 
studied model organism. 

Through examination of the novel CRPs, we also noted 
that while some of the novel hits have a very significant 
hmmsearch E-value (e.g., 10~^^), most have moderate E- 
values (e.g., 10""^ — 10~^), suggesting that their sequence 
similarity to the input HMM profile is limited. While the 
input profile alignments were manually built and may not 



be exhaustive in capturing all groups of CRPs, we specu- 
late that some of these novel CRPs might form new clades 
that define novel profile alignments, separate from the 
original alignment. Consequently, a new round of genome 
scans using these novel profiles has the potential for cap- 
turing even more members that have been missed in the 
previous search. 

Case study: the S-Protein Homologue (SPH) family in 
Arabidopsis 

In addition to plant CRPs, SPADA is readily general- 
izable to other classes of putative secreted peptides by 
substituting an appropriate sets of HMMs in place of 
CRP HMMs. Here, we used the SPH peptides (S-Protein 
Homologue) [37] as an example. A seed alignment includ- 
ing 45 plant self-incompatibility protein SI sequences 
(PF05938) was obtained from the Pfam database. An 
HMM profile was built from this alignment and then used 
as input to scan the Arabidopsis genome (TAIRIO [38]) 
by running SPADA. SPADA predicted 92 SPH peptides in 
total (Additional file 16: Table S9 and Additional file 17: 
File S4). Forty-five (45) of these predictions are identical 
with TAIRIO annotation. Seventeen have minor discrep- 
ancies with TAIRIO gene models (coding regions all in 
the same reading frame but have a boundary conflict of 
less than 15 amino acids, probably resulting from differ- 
ent start codons or alternative splice sites). Nine are in 
major conflict with existing gene models (coding regions 
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in different reading frames or having serious boundary 
conflict). We also discovered 21 new SPHs not present 
in TAIRIO. Through manual inspection of gene models 
and sequence alignments with other family members, we 
found 3 out of the 15 major conflicts reflect an error 
in TAIRIO (Figure 4A gives an example), while 19 out 
of the 21 models absent from TAIRIO are true mem- 
bers of the SPH family, missed by the current genome 
annotation (Figure 4B shows an example). Thus, we 
demonstrate that SPADA accurately detects other classes 
of secreted peptides given a well-constructed profile 
alignment. 

Case study: a fungal cyclic peptide family in Amanita 
bisporigera 

In order to assess whether SPADA could be useful in 
searches for families of small non-secreted peptides out- 
side the plant kingdom, we also examined the fungal 
cyclic peptides of Amanita mushrooms [39]. This fam- 
ily includes the amatoxins and phallotoxins, such as 
Qf-amanitin and phalloidin, respectively, which are synthe- 
sized as proproteins of 34-35 amino acids. We began by 
creating a multiple sequence alignment via ClustalO of 
reported proproteins [39] and executed SPADA as usual 
with the signal peptide filter turned off, using Arabidop- 
sis as the training model for Augustus in searching the 
low-coverage genome contigs of Amanita bisporigera. As 
a negative control, we scanned the genome of Amanita 



thiersii, which is non-toxic and not known to produce this 
class of toxins. 

SPADA identified five new peptides in the incompleted. 
bisporigera genome with strong homology (2.4 x 10~^^ < 
E < 5.8 X 10~^) that fit well with the alignment of known 
proproteins (Figure 5). One additional hit ran off the 
end of the contig, producing the incomplete propeptide 
"MSDTNVMRLPFTTP". No additional predictions were 
made by SPADA with E < 0.01 beyond sequences that 
were already included in the original alignment. Fur- 
ther, SPADA did not identify any hits when scanning the 
genome of A. thiersii, as would be expected from that 
organisms non-toxic nature. 

Discussion 

Homology-based gene prediction 

Unlike general-purpose gene predicting programs, 
SPADA works as a family-based gene finder. The major 
difference between SPADA and general gene predicting 
programs is that it incorporates prior information from 
the family profile in the prediction process. SPADA takes 
advantage of generic gene prediction programs, but goes 
a step further by suggesting where to look for family 
members. Through scanning the target genome using 
pre-built family-specific alignments, SPADA identifies 
and builds "extented hits" that serve as the backbone of 
the underlying exonic structure. This location informa- 
tion greatly improves prediction accuracy, as shown by 
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Figure 4 SPADA detects mis-annotated and novel SPH peptides in TAIR1 0. (A) SPADA detects an SPH peptide (hOOl 8.02) that is 
mis-annotated in TAIRl 0; (B) SPADA detects a novel SPH peptide (hOOl 3.02) not present in TAIRl 0. Multiple sequence alignment of selected SPH 
peptides are shown below with hOOl 8.02 and hOOl 3.02 shaded. 
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Figure 5 Multiple sequence alignment of Amanita toxin proproteins. Sequences identified by SPADA are labeled as "hm****". All remaining 
sequences were obtained from Hallen etal. [39], and were included in the initial alignment used as input for SPADA. 



the different performances of "Augustus_de_novo" and 
"Augustus_evidence" components in Figure 2. Among the 
six individual predicting components, the four that do not 
require additional information and make de novo predic- 
tions all yield low sensitivities. The other two approaches, 
"Augustus_evidence" and "GeneWise+SplicePredictor", 
make use of the location information and are able to 
predict most of the true positives. 

Family-based gene prediction was first introduced in the 
AUGUSTUS package as the AUGUSTUS-PPX (Protein 
Profile extension) module [17]. Although AUGUSTUS- 
PPX was shown to be more sensitive and accurate in 
predicting long, multi-exon gene family members than the 
standard AUGUSTUS algorithm, its approach is not suit- 
able for small, divergent peptide families such as CRPs, 
SPHs or Amanita toxin-like peptides examined here. 
Rather than using the entire protein family alignment pro- 
file as input, AUGUSTUS-PPX makes use of conserved, 
ungapped blocks from the alignment to make a profile. 
This enables the algorithm to identify core match regions 
in the genome sequence which together act as a scaffold 
in the gene prediction. A modification of the standard 
AUGUSTUS gene-centric HMM is then used to fill in the 
pieces between scaffold elements with splice elements and 
other signals, ultimately emitting the most probable full 
gene structure. While this approach works well for fami- 
lies of typical genes with large numbers of conserved ele- 
ments, it completely breaks down when applied to small, 
divergent peptide families like the CRPs, as these fami- 
lies tend to contain no conserved, ungapped regions of 
appreciable size to seed the initial scaffold. Indeed, when 



we applied AUGUSTUS-PPX to the CRPs we observed no 
improvement over "Augustus_<ie_novo". 

Improving prediction accuracy by model evaluation 

The default SPADA pipeline (running the "Augustus_ 
evidence" and "GeneWise+SplicePredictor" components) 
achieves even higher sensitivity than the two individual 
components. This owes to the model evaluation & selec- 
tion step. For each HMM hit, SPADA collects all candidate 
gene models built by its model predicting components, 
and in the model evaluation step, picks a best candidate 
model based on multiple evaluation statistics. True fam- 
ily members will probably get a high-scoring gene model, 
while most false positive hits will have no qualifying or 
only low-scoring gene models built. High-scoring gene 
models that passed the filter are more likely to be true 
models since they are the ones that best fit the family- 
specific alignment. 

Pseudogenes and gene models without expression 
evidence may still have significant value 

SPADA identifies paralogous gene family members 
throughout the genome. Many of these predictions cur- 
rently lack expression evidence and some of the gene 
predictions have premature stop codons suggesting they 
may be psuedogenes. Nonetheless, it is important to iden- 
tify all gene family members, regardless of their expres- 
sion and pseudogene status, especially in evolutionarily 
dynamic gene families,. The semi-automated approach 
that inspired SPADAs development identified hundreds 
of defensin-like genes in Arabidopsis which, at the time. 
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had no expression evidence [40] . Later, these genes turned 
out to be highly specifically expressed in reproductive tis- 
sues not previously examined with earlier genome-wide 
expression approaches [41]. Moreover, one must also be 
careful not to discard pseudogene predictions that are 
highly similar to other family members. A gene that 
appears as a pseudogene in the reference sequenced acces- 
sion of a species may indeed be fully intact in other 
accessions, as observed among the defensin-like pollen- 
tube attractant, AtLUREs [42]. In their study, Takeuchi 
and Higashiyama observed half a dozen AtLUREs with 
disabling mutations in non-reference accessions, as well 
as putative functional and intact forms of AtLUREs 1.5 
and 1.6, which are pseudogenes in the reference Col-0 
genotype [42]. 

Improving SSP annotation in current plant SSP databases 

Previous work has sought to exhaustively identify small 
secreted peptides (SSP) in Arabidopsis [4], rice [7] and 
Populus deltoides [6]. These earlier studies only scanned 
short ORFs (25-250 amino acids) in translated genome 
sequence, though Pan et al [7] did include multiple-exon 
gene predictions from ab initio gene predition programs 
such as Fgenesh and Augustus. However, with a primary 
focus on detecting all small secreted peptides, these stud- 
ies did not utilize protein family information in the model 
building process since secreted peptides are so diverse. 
The Arabidopsis Unannotated Secreted Peptide Database 
(AUSPD) only contains one-exon ORE predictions, and 
thus mis-annonates most (if not all) two-exon secreted 
peptides (Additional file 18: Figure S5 for example). 
The OrysPSSP database (comparative Platform for Small 
Secreted Proteins from rice from rice and other plants) 
does contain multi-exon models predicted by Fgenesh 
(0.72%) and Augustus (1.16%) in addition to single-exon 
ORFs [7]. However, since no prior information is incorpo- 
rated into predictions by these ab initio gene predicting 
programs, multi-exon models in OrysPSSP are frequently 
in conflict with the true rice CRPs (Additional file 19: 
Figure S6 for example). As a result, while most single-exon 
peptides in Arabidopsis and rice are captured in AUSPD 
and OrysPSSP respectively, a large portion of the two- 
exon and multi-exon genes (such as CRP0000-CRP1530) 
are clearly under-represented in these two databases. 
SPADA, on the other hand, used additional gene struc- 
ture information obtained in the motif mining step and 
was able to correctly predict most of the CRP models 
(Additional file 20: Table SIO). 

Complementarity of SPADA to generic gene prediction 
programs 

SPADA is not designed to identify all genes in a genome. 
However, its applicability to new annotation projects 



steadily will increase due to the marked growth of pro- 
tein sequence family signatures and alignments. Inter- 
Pro release 43.0 contains 16,652 protein family signatures. 
In the last 3 years, the number of families characterized 
by InterPro has increased by 24%, compared with a 38% 
increase in the 3 years prior to that [9]. (Release 29.0 
from October 2010 had 13,382 family entries; Release 
16.1 from October 2007 had 9,729 entries.) It should be 
noted that SPADA is unlikely to perform well with genes 
that have large numbers of exons due to the combinatoric 
explosion of potential splice donor and acceptor pair com- 
binations to evaluate. For longer multi-exon gene families, 
AUGUSTUS-PPX should be used. Still, SPADA has been 
shown here to be extremely effective in predicting fami- 
lies of one- and two- exon genes often missed or excluded 
by standard gene prediction algorithms [3,4]. Hence, it 
is anticipated that gene annotation pipelines would be 
improved by routinely running SPADA to pick up small 
genes in addition to the standard generic gene prediction 
algorithms (e.g., Augustus) for larger genes. 

Impact of better gene prediction algorithms on plant 
genomics 

As sequencing costs have come down, there has been a 
commensurate expansion in the sequencing of multiple 
plant genomes within each species. Moreover, Genome 
Wide Association Studies (GWAS) are now routinely car- 
ried out in these populations. Gene annotation cannot be 
simply transferred across members of a species due to 
the myriad of SNPs and indels that alter gene structures. 
Gan etal estimated that gene structural changes occurred 
in more than 30% of genes among the 18 Arabidopsis 
accessions they resequenced and assembled [43]. Further, 
GWAS studies have repeatedly implicated unannotated 
intergenic regions as having the most significant associ- 
ation with important agronomic traits [44,45]. While it 
is likely that many of these GWAS peaks identify non- 
coding RNAs or regions in strong linkage disequlibrium 
with causative variants, we suspect that many of these 
sites may actually mark members of as yet unannotated 
families of small genes. Indeed, in our own GWA stud- 
ies [46], many peaks turned out to coincide with NCR 
or other CRP family members that prior to our intensive 
family-based annotation studies had been un-annotated 
in Medicago. 

Discovery of genes resembling Nodule-Cysteine-Rich (NCR) 
peptides in Arabidopsis 

In striking contrast to Arabidopsis, the Medicago genome 
harbors a huge number (583 versus 3) of Nodule 
Cysteine-Rich peptides (NCRs, CRP1130-CRP1530) - 
Defensin-Like proteins with nodule-specific expression 



Zhou etai. BMC Bioinformatics 2013, 14:335 
http://www.biomedcentral.eom/1 471 -21 05/1 4/335 



Page13of 16 



(Table 1). These NCRs are unique to Medicago (specif- 
ically, legumes in the Inverted Repeat-Lacking Clade) 
[47] and have recently been shown to play vital roles 
in the communication between Medicago and symbiotic 
rhizobia [48,49]. 

Surprisingly, three CRPs were found in the Arabidop- 
sis genome falling into the nodule-specific sub-families 
(CRP1130-CRP1530, or NCRs). Previously, NCRs were 
thought to be unique to Medicago and other IRLC 
legumes, playing a vital role in the legume-rhizobia sym- 
biotic interaction [50]. Looking closely at the sequence 
alignments (Additional file 21: Figure S7), these "Ara- 
bidopsis NCRs" have all the conserved cysteine residues 
in the expected configuration, while also exhibiting sub- 
stantial divergence from Medicago NCRs - and forming 
a separate Arabidopsis-specific clade. Furthermore, only 
one Arabidopsis NCR is predicted in each sub-class. It 
is possible, therefore, that these "Arabidopsis NCRs" are 
descendants from the most recent common ancestral 
genes that later evolved into Medicago NCRs. After the 
Kv^b'ido^siS'Medicago divergence, these ancient NCRs 
could have become increasingly divergent in the legume 
{Medicago) clade, eventually gaining new functions in 
nodule development and symbiosis, possibly through neo- 
functionalization, conferring a selective advantage and 
thus increasing rapidly in copy number through gene 
duplication. 

Limitations of the SPADA pipeline 

Because the model prediction step in the pipeline is not 
optimized for multi-exon gene models nor the extremely 
large introns present in animal genomes, we do not yet 
recommend SPADA to identify small peptides in ani- 
mals (especially mammals). Also, SPADA is not expected 
to work well with bacterial genomes due to the absence 
of introns in their gene models. However, we speculate 
this pipeline will work well with organisms such as yeast, 
oomycete and fungi, since they have similar gene struc- 
tures to plants [51,52]. In fact, it was recently found 
that oomycetes and fungi genomes encode large num- 
ber of secreted effectors as a result of the evolutionary 
"arms-race" between pathogen and host [53,54]. Poten- 
tially, SPADA will be useful in effector discovery in these 
pathogen genomes given that a growing number of infor- 
mative family alignments are becoming available. 

The SPADA pipeline is useful beyond secreted peptides 
and outside plants 

Although SPADA was initially designed to target secreted 
peptide families in plants, it can be used on non- 
secreted peptide families, especially in fungal systems. In 
the Results section, we tested SPADA using draft genome 



contigs of the mushroom A, bisporigera in search of 
additional members of a class of potent liver toxin pro- 
peptides characterized in earlier work [39]. Roughly 20 
pro-peptides belonging to this family had been cloned and 
sequenced, with only about a dozen present among the 
draft genome sequence contigs. SPADA identified 5 new 
family members with convincing alignments and signifi- 
cant E-values (2.4 x 10"^^ < E < 5.8 x 10"^). Three of 
these were in contigs long enough that extensive homol- 
ogy in the 3'-UTR region characteristic of the family 
could be observed. The contigs of the remaining two 
hits ended shortly after the coding sequence preventing 
3'-UTR homology characteristics from being confirmed. 
When the same input HMM constructed from the known 
20 pro-peptides [39] was used to scan against a related 
mushroom Amanita thiersii that is known not to pro- 
duce toxic peptides, SPADA did not identify any candidate 
genes with E <0.01. 

Conclusions 

SPADA is a homology-based gene prediction program 
to accurately identify and predict the gene structure for 
short peptides with one to a few exons. SPADA works 
well on small plant peptides such as the cysteine-rich 
peptide families. SPADA gives much more accurate and 
precise gene calls than traditional ab initio gene find- 
ing programs in tested genomes. Running SPADA on 
less well-annotated plant genomes (e.g., Medicago) reveals 
numerous mis-annotated and unannotated CRPs in the 
current genome annotation. Predictions made by SPADA 
constitute the most complete set of plant cysteine-rich 
peptides, and in this regard, will provide an invaluable 
resource for the research of small, secreted peptides in 
plants. The systematic application of SPADA to other 
classes of small peptides by communities will greatly 
improve the genome annotation of different protein fami- 
lies in public genome databases. 

Availability and requirements 

Project name: Small Peptide Alignment Discovery Appli- 
cation 

Project home page: https://github.com/orionzhou/ 
SPADA 

Operating system(s): Linux 
Programming language: Perl 

Other requirements: SPADA is a perl-based pipeline 

that internally runs HMMer, Augustus, Genewise, Spli- 

cePredictor and a number of custom scripts. Instructions 

for installation and running are available at https://github. 

com/orionzhou/SPADA/wiki. 

License: Apache License, Version 2.0 

Any restrictions to use by non-academics: None 
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and M. truncatula (in GFF3 format). 

Additional file 2: Table S3. Expression support of the Arabidopsis CRP 
test set. 

Additional file 3: Table S4. Expression support of the Medicago CRP test 
set. 

Additional file 4: Figure SI. Performance comparison of five gene 
prediction components under different search E-value thresholds. 

Additional file 5: Table SI . CRPs predicted by SPADA in A. thaliana using 
E-value threshold of 0.001. 

Additional file 6: Table S2. CRPs predicted by SPADA in M. truncatula 
using E-value threshold of 0.001. 

Additional file 7: Figure S2. Genome distribution of CRPs predicted in 
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Additional file 8: Figure S3. Genome distribution of CRPs predicted in 
Medicago truncatula. 

Additional file 9: File S2. CRP predictions made by SPADA in A. thaliana 
and M. truncatula using search E-value threshold of 0.001 (in GFF3 format). 

Additional file 10: Figure S4. Multiple sequence alignments of Medicago 
CRP sub-families CRPOOOO and CRP1400. 

Additional file 1 1 : Table S5. Expression support of the CRPs predicted by 
SPADA in A. thaliana. 

Additional file 12: Table S6. Expression support of the CRPs predicted by 
SPADA in M. truncatula. 

Additional file 1 3: Table S7. CRPs predicted by SPADA in A. thaliana 
using E-value threshold of 1 . 

Additional file 14: Table S8. CRPs predicted by SPADA in M. truncatula 
using E-value threshold of 1 . 

Additional file 1 5: File S3. Novel CRP predictions made by SPADA in A. 
thaliana and M. truncatula as determined by manual inspection (in GFF3 
format). 

Additional file 1 6: Table S9. SPH peptides predicted by SPADA in A. 
thaliana. 

Additional file 1 7: File S4. SPH predictions made by SPADA in 
Arabidopsis using search E-value threshold of 0.001 (in GFF3 format). 

Additional file 1 8: Figure S5. A typical Arabidopsis CRP mis-annonated 
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Additional file 1 9: Figure S6. A typical rice CRP mis-annonated in 
OrysPSSP. 
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Additional file 21: Figure S7. Sub-class alignments of three Arabidopsis 
NCRs with Medicago NCRs. In each alignment the first sequence comes 
from Arabidopsis and the rest all come from Medicago. 
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